View of Hoover Dam. Source: Wikimedia Commons
An investigation into whether a country’s hydroelectricity output is positively and directly related to GDP.
Chris Chae, Martin Hsu, Gillian Ippoliti, and Sydney Oberg
Do richer countries produce more hydroelectric power? Hydroelectric power is harnessed by using running water to turn a turbine which then powers a generator to convert it to electricity. Water has been used as a power source for millenia, but as technology has advanced, more sophisticated hydroelectric power systems have been utilized as an more sustainable alternative to fossil fuels. As with many sustainable advancements, the cost of switching to hydroelectric power instead of fossil fuels can be a significant obstacle. This leads to the question of how a country’s GDP can influence their utilization of hydroelectric power. Overall, have wealthier countries been quicker to utilize this sustainable power source? Our team investigated the connection between GDP and hydroelectric power using data from GapMinder, a global trends data source, from 1959 to 2010.
In order to investigate this question, our team pulled and cleaned two datasets from gapminder.org – one on hydro electricity production per person, and another on GDP per capita, by country and year.
We combined the datasets and created a linear model comparing the hydroelectricity production to GDP. The final dataset observations can be previewed below.
The following variables are included:
country - Country where datapoint was collected
year - Year in which data was collected from the country
gdp_person - GDP per capita, in 2010 inflation-adjusted (real) USD
hydro_person - Hydroelectric power produced per person, in tonnes of oil equivalent (toe)
Below is a visualization of the data directly plotting the hydroelectric production as a response of GDP per capita, regardless of year or country. We can see that there is a positive association between the two variables. So, higher GDP may mean more hydroelectric power per person.
To investigate how this relationship has changed over time, we separated the datapoints by year and created a regression for each. We also plotted the slopes of each year as a line graph, showing how it changed over time. Both can be seen below. Over time, the slope is generally positive and this would show that there is a consistent positive relationship between GDP and hydroelectricity output per person. As seen in the regression plot and the plot of the slope over time, the most positive relationship between GDP and hydro-power was in the 1960s. However, the slope becomes less and less positive over time.
This means that the correlation between richer countries having more hydropower output is weakening over time!
Based on the data and visualizations, we fit an overall linear model where hydroelectric production is the response and GDP is the predictor. The components can be seen in the table below. Of particular interest is the estimate column, highlighted in red. It shows the estimated intercept (from the (Intercept) row) and slope (from the gdp_person row).
| term | estimate | std.error | statistic | p.value |
|---|---|---|---|---|
| (Intercept) | 0.0015511 | 0.0051454 | 0.3014634 | 0.7630772 |
| gdp_person | 0.0000086 | 0.0000003 | 31.4404317 | 0.0000000 |
Our final equation is therefore:
\[ (\operatorname{Hydroelectric\ Production\ per\ Person}) = \alpha + \beta_{1}(\operatorname{GDP\ per\ Capita}) + \epsilon \]
Where the intercept \(\alpha = 0.0015511\), the slope \(\beta_{1} = 0.0000086\) and \(\epsilon\) is the random error “noise.”
This can be interpreted as for every 1$USD increase in GDP per capita, hydroelectric energy output per person increases by about 8.6E-6 tonnes of oil equivalent. A country with GDP per capita of 0 would have 0.0016 toe of hydroelectric output.
However, let’s go back to the visualization. the data does not seem very linear overall. It looks like a lot countries have very low hydroelectric output across a wide range of the GDP. The data seems to “fan” out as GDP increases and doesn’t clearly follow the regression line. As a result, before we can draw the conclusions we made before, we have to assess if linear model actually is a good fit based on the appropriate statistical indicators.
Below are the variance for the following model aspects:
hydro_person - The observed values of hydroelectric power per person,
.fitted - The fitted or predicted values of the model based on GDP per capita,
.resid - The residuals of the model, equal to the observed minus the predicted values.
The observed variance is the total variance, split between the fitted and residual variance.
| statistic | hydro_person | .fitted | .resid |
|---|---|---|---|
| variance | 0.0850756 | 0.0171156 | 0.06796 |
Most of the variance in hydroelectric production is covered by residuals, which means it is mostly unexplained variance. This is not a good sign for the quality of our model.
A good indicator of model fit based on these values is RSquared. It answers the question “What percent of variability in hydroelectric output is explained by GDP per capita?” The closer RSquared is to 1, the better fit the GDP per capita model is, as it explains a higher percentage of the variability. The RSquared can be found in the table below, highlighted in red:
| r.squared | adj.r.squared | sigma | statistic | p.value | df | logLik | AIC | BIC | deviance | df.residual | nobs |
|---|---|---|---|---|---|---|---|---|---|---|---|
| 0.2011805 | 0.200977 | 0.2607247 | 988.5007 | 0 | 1 | -292.1429 | 590.2859 | 609.1128 | 266.8111 | 3925 | 3927 |
The RSquared is about 0.20. Therefore, the proportion of the variability in the response values that was accounted for by the regression generated by hydroelectric energy output per person and GDP is about 20%. This means that the GDP in the regression plot explains about 20% of the Hydro-electric output variability in each country. The fit of the created regression plot is not as strong as many would prefer, but it still has enough of a correlation to show that there is a connection between overall richer countries and higher use of hydro-electric power. Although, since only 20% of the hydro-electric output per person can be explained in the regression plot, we are led to believe that there are other variables that likely have an effect on which countries use more hydro-electric power.
We have another way that we can assess the fit of the model. If we use our linear model to simulate hydroelectric output data points, and our simulated data looks similar to our observed data, then our regression is a good fit for the data.
As a result, we simulated hydroelectric output values based off our model, plotted the distribution of the data, and compared it to the distribution of the observed hydroelectric output.
The distributions are very different. The observed hydro electric per person distribution is very right-skewed while the simulated values are normally distributed. This likely means that our prediction model is not a good fit for the observed data.
To confirm this, we plotted the observed data as a response of the simulated data, as seen below. The closer the simulated data is to the predicted data, the closer the data will line up against a line with slope 1 on the graph. This line appears on the graph as a dashed red line.
The values in our plot of observed hydro energy output per person and simulated hydro energy output per person do not appear to fit to the slope line of 1. This means that the predicted hydro energy output per person does not generate data the is similar to the observed data.
But what if our simulated values for that one trial just happen to not fit the observed data very well? In order to confirm that this single simulation is not just an unlikely outcome, we simulated this process 1000 times. This means we generated 1000 sets of simulated values and compared each of the simulated sets against the set of the observed data for 1000 comparisons in all.
The plot below shows the distribution of 1000 RSquareds, one for each simulated set compared against observed data. The generated plot shows that the overall average RSquared value for the simulations would be between .04 and .05. This indicates that there is a high degree of certainty that very little of the variation in the data can be explained by simulated values from our model. Our model is not a good fit.
We also found the correlation coefficient, or r value, for each simulated comparison as well. This plot below of simulated r values confirms that with the simulated values it would be most likely that there was a weak positive or no correlation between the observed value and regression model.
From our analysis, we can conclude that though the linear correlation is positive (meaning that higher GDP on average means more hydroelectric energy production), there is essentially very little evidence of direct correlation between the hydroelectricity production and the gross domestic product. Based on the graph, you can see that there is no linear consistency in the data points as majority of them are clumped towards the bottom of the graph with some points higher in the graph making the linear regression a little higher. The model and simulated RSquareds were very low. From that, we concluded that our model is not a good fit.
There may be other factors to explore that explain hydroelectric production better than just GDP. Exploring the relationship by year and by country may also yield interesting and more significant results.